Cassava Leaf Disease Classification
- Dependencies
- Load data
- Building the model
- Train the model
- Evaluating our model
- Making predictions
- Creating a submission file
This is an introductory notebook contains creating a baseline model using Tensor Processing Units (TPUs) and begins making submissions to the Cassava Leaf Disease Classification competition.
TPUs with TensorFlow
We'll be using TensorFlow and Keras to build our computer vision model, and using TPUs to both train our model and make predictions.
References
This notebook was built using the following amazing resources created by Kagglers:
- Learn With Me: Getting Started with Tensor Processing Units (TPUs)
- Martin Gorner: Getting Started With 100 Flowers on TPU
- Amy Jang: TensorFlow + Transfer Learning: Melanoma
- Phil Culliton: [A Simple TF 2.1 Notebook](https://www.kaggle.com/philculliton/a-simple-tf-2-1-notebook
import math, os, re, warnings, random
import numpy as np
import pandas as pd
import seaborn as sns
from functools import partial
from matplotlib import pyplot as plt
from sklearn.utils import class_weight
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from tensorflow import keras
import tensorflow as tf
#import efficientnet.tfkeras as efn
def seed_everything(seed=0):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
seed = 0
seed_everything(seed)
warnings.filterwarnings('ignore')
AUTOTUNE = tf.data.experimental.AUTOTUNE
GCS_PATH = 'gs://kds-041bbb000630fa3aaebc67ffddc6dd4b536981c56338d1734b8511fe'
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
IMAGE_SIZE = [512, 512]
CLASSES = ['0', '1', '2', '3', '4']
EPOCHS = 25
Load data
The data we're working with have been formatted into TFRecords, which are a format for storing a sequence of binary records. TFRecords work really well with TPUs, and allow us to send a small number of large files across the TPU for processing.
If you'd like to learn more about TFRecords and maybe even try creating them yourself, check out this TFRecords Basics notebook and corresponding video from Kaggle Data Scientist Ryan Holbrook.
Because our data consists of training and test images only, we're going to split our training data into training and validation data using the train_test_split() function.
Decode the data
In the code chunk below we'll set up a series of functions that allow us to convert our images into tensors so that we can utilize them in our model. We'll also normalize our data. Our images are using a "Red, Blue, Green(RGB)" scale that has a range of [0,255], and by normalizing it we'll set each pixel's value to a number in the range of [0, 1].
def decode_image(image):
image = tf.image.decode_jpeg(image, channels=3)
image = tf.cast(image, tf.float32) / 255.0
image = tf.reshape(image, [*IMAGE_SIZE, 3])
return image
def read_tfrecord(example, labeled):
tfrecord_format = {
"image": tf.io.FixedLenFeature([], tf.string),
"target": tf.io.FixedLenFeature([], tf.int64)
} if labeled else {
"image": tf.io.FixedLenFeature([], tf.string),
"image_name": tf.io.FixedLenFeature([], tf.string)
}
example = tf.io.parse_single_example(example, tfrecord_format)
image = decode_image(example['image'])
if labeled:
label = tf.cast(example['target'], tf.int32)
return image, label
idnum = example['image_name']
return image, idnum
We'll use the following function to load our dataset. One of the advantage of a TPU is that we can run multiple files across the TPU at once, and this accounts for the speed advantages of using a TPU. To capitalize on that, we want to make sure that we're using data as soon as it streams in, rather than creating a data streaming bottleneck.
def load_dataset(filenames, labeled=True, ordered=False):
ignore_order = tf.data.Options()
if not ordered:
ignore_order.experimental_deterministic = False # disable order, increase speed
dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE) # automatically interleaves reads from multiple files
dataset = dataset.with_options(ignore_order) # uses data as soon as it streams in, rather than in its original order
dataset = dataset.map(partial(read_tfrecord, labeled=labeled), num_parallel_calls=AUTOTUNE)
return dataset
A note of using train_test_split()
While i used train_test_split() to create both a training and validation dataset, consider exploring cross validation instead.
TRAINING_FILENAMES, VALID_FILENAMES = train_test_split(
tf.io.gfile.glob(GCS_PATH + '/train_tfrecords/ld_train*.tfrec'),
test_size=0.35, random_state=5
)
TEST_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/test_tfrecords/ld_test*.tfrec')
def data_augment(image, label):
# Thanks to the dataset.prefetch(AUTO) statement in the following function this happens essentially for free on TPU.
# Data pipeline code is executed on the "CPU" part of the TPU while the TPU itself is computing gradients.
image = tf.image.random_flip_left_right(image)
return image, label
def get_training_dataset():
dataset = load_dataset(TRAINING_FILENAMES, labeled=True)
dataset = dataset.map(data_augment, num_parallel_calls=AUTOTUNE)
dataset = dataset.repeat()
dataset = dataset.shuffle(2048)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(AUTOTUNE)
return dataset
def get_validation_dataset(ordered=False):
dataset = load_dataset(VALID_FILENAMES, labeled=True, ordered=ordered)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.cache()
dataset = dataset.prefetch(AUTOTUNE)
return dataset
def get_test_dataset(ordered=False):
dataset = load_dataset(TEST_FILENAMES, labeled=False, ordered=ordered)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(AUTOTUNE)
return dataset
def count_data_items(filenames):
n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
return np.sum(n)
NUM_TRAINING_IMAGES = count_data_items(TRAINING_FILENAMES)
NUM_VALIDATION_IMAGES = count_data_items(VALID_FILENAMES)
NUM_TEST_IMAGES = count_data_items(TEST_FILENAMES)
print('Dataset: {} training images, {} validation images, {} (unlabeled) test images'.format(
NUM_TRAINING_IMAGES, NUM_VALIDATION_IMAGES, NUM_TEST_IMAGES))
print("Training data shapes:")
for image, label in get_training_dataset().take(3):
print(image.numpy().shape, label.numpy().shape)
print("Training data label examples:", label.numpy())
print("Validation data shapes:")
for image, label in get_validation_dataset().take(3):
print(image.numpy().shape, label.numpy().shape)
print("Validation data label examples:", label.numpy())
print("Test data shapes:")
for image, idnum in get_test_dataset().take(3):
print(image.numpy().shape, idnum.numpy().shape)
print("Test data IDs:", idnum.numpy().astype('U')) # U=unicode string
np.set_printoptions(threshold=15, linewidth=80)
def batch_to_numpy_images_and_labels(data):
images, labels = data
numpy_images = images.numpy()
numpy_labels = labels.numpy()
if numpy_labels.dtype == object: # binary string in this case, these are image ID strings
numpy_labels = [None for _ in enumerate(numpy_images)]
# If no labels, only image IDs, return None for labels (this is the case for test data)
return numpy_images, numpy_labels
def title_from_label_and_target(label, correct_label):
if correct_label is None:
return CLASSES[label], True
correct = (label == correct_label)
return "{} [{}{}{}]".format(CLASSES[label], 'OK' if correct else 'NO', u"\u2192" if not correct else '',
CLASSES[correct_label] if not correct else ''), correct
def display_one_plant(image, title, subplot, red=False, titlesize=16):
plt.subplot(*subplot)
plt.axis('off')
plt.imshow(image)
if len(title) > 0:
plt.title(title, fontsize=int(titlesize) if not red else int(titlesize/1.2), color='red' if red else 'black', fontdict={'verticalalignment':'center'}, pad=int(titlesize/1.5))
return (subplot[0], subplot[1], subplot[2]+1)
def display_batch_of_images(databatch, predictions=None):
"""This will work with:
display_batch_of_images(images)
display_batch_of_images(images, predictions)
display_batch_of_images((images, labels))
display_batch_of_images((images, labels), predictions)
"""
# data
images, labels = batch_to_numpy_images_and_labels(databatch)
if labels is None:
labels = [None for _ in enumerate(images)]
# auto-squaring: this will drop data that does not fit into square or square-ish rectangle
rows = int(math.sqrt(len(images)))
cols = len(images)//rows
# size and spacing
FIGSIZE = 13.0
SPACING = 0.1
subplot=(rows,cols,1)
if rows < cols:
plt.figure(figsize=(FIGSIZE,FIGSIZE/cols*rows))
else:
plt.figure(figsize=(FIGSIZE/rows*cols,FIGSIZE))
# display
for i, (image, label) in enumerate(zip(images[:rows*cols], labels[:rows*cols])):
title = '' if label is None else CLASSES[label]
correct = True
if predictions is not None:
title, correct = title_from_label_and_target(predictions[i], label)
dynamic_titlesize = FIGSIZE*SPACING/max(rows,cols)*40+3 # magic formula tested to work from 1x1 to 10x10 images
subplot = display_one_plant(image, title, subplot, not correct, titlesize=dynamic_titlesize)
#layout
plt.tight_layout()
if label is None and predictions is None:
plt.subplots_adjust(wspace=0, hspace=0)
else:
plt.subplots_adjust(wspace=SPACING, hspace=SPACING)
plt.show()
training_dataset = get_training_dataset()
training_dataset = training_dataset.unbatch().batch(20)
train_batch = iter(training_dataset)
display_batch_of_images(next(train_batch))
validation_dataset = get_validation_dataset()
validation_dataset = validation_dataset.unbatch().batch(20)
valid_batch = iter(validation_dataset)
display_batch_of_images(next(valid_batch))
testing_dataset = get_test_dataset()
testing_dataset = testing_dataset.unbatch().batch(20)
test_batch = iter(testing_dataset)
display_batch_of_images(next(test_batch))
Building the model
Learning rate schedule
We learned about learning rates in the Intro to Deep Learning: Stochastic Gradient Descent lesson, and here I've created a learning rate schedule mostly using the defaults in the Keras Exponential Decay Learning Rate Scheduler documentation (I did change the initial_learning_rate. You can adjust the learning rate scheduler below, and read more about the other types of schedulers available to you in the Keras learning rate schedules API.
print("Tensorflow version " + tf.__version__)
lr_scheduler = keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=1e-5,
decay_steps=10000,
decay_rate=0.9)
Building our model
In order to ensure that our model is trained on the TPU, we build it using with strategy.scope().
This model was built using transfer learning, meaning that we have a pre-trained model (ResNet50) as our base model and then the customizable model built using tf.keras.Sequential. If you're new to transfer learning I recommend setting base_model.trainable to False, but do encourage you to change which base model you're using (more options are available in the tf.keras.applications Module documentation) as well iterate on the custom model.
Note that we're using sparse_categorical_crossentropy as our loss function, because we did not one-hot encode our labels.
with strategy.scope():
img_adjust_layer = tf.keras.layers.Lambda(tf.keras.applications.resnet50.preprocess_input, input_shape=[*IMAGE_SIZE, 3])
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_model.trainable = False
model = tf.keras.Sequential([
tf.keras.layers.BatchNormalization(renorm=True),
img_adjust_layer,
base_model,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(8, activation='relu'),
#tf.keras.layers.BatchNormalization(renorm=True),
tf.keras.layers.Dense(len(CLASSES), activation='softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=lr_scheduler, epsilon=0.001),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
train_dataset = get_training_dataset()
valid_dataset = get_validation_dataset()
STEPS_PER_EPOCH = NUM_TRAINING_IMAGES // BATCH_SIZE
VALID_STEPS = NUM_VALIDATION_IMAGES // BATCH_SIZE
history = model.fit(train_dataset,
steps_per_epoch=STEPS_PER_EPOCH,
epochs=EPOCHS,
validation_data=valid_dataset,
validation_steps=VALID_STEPS)
With model.summary() we'll see a printout of each of our layers, their corresponding shape, as well as the associated number of parameters. Notice that at the bottom of the printout we'll see information on the total parameters, trainable parameters, and non-trainable parameters. Because we're using a pre-trained model, we expect there to be a large number of non-trainable parameters (because the weights have already been assigned in the pre-trained model).
model.summary()
Evaluating our model
The first chunk of code is provided to show you where the variables in the second chunk of code came from. As you can see, there's a lot of room for improvement in this model, but because we're using TPUs and have a relatively short training time, we're able to iterate on our model fairly rapidly.
print(history.history.keys())
history_frame = pd.DataFrame(history.history)
history_frame.loc[:, ['loss', 'val_loss']].plot()
history_frame.loc[:, ['sparse_categorical_accuracy', 'val_sparse_categorical_accuracy']].plot();
def to_float32(image, label):
return tf.cast(image, tf.float32), label
test_ds = get_test_dataset(ordered=True)
test_ds = test_ds.map(to_float32)
print('Computing predictions...')
test_images_ds = testing_dataset
test_images_ds = test_ds.map(lambda image, idnum: image)
probabilities = model.predict(test_images_ds)
predictions = np.argmax(probabilities, axis=-1)
print(predictions)
print('Generating submission.csv file...')
test_ids_ds = test_ds.map(lambda image, idnum: idnum).unbatch()
test_ids = next(iter(test_ids_ds.batch(NUM_TEST_IMAGES))).numpy().astype('U') # all in one batch
np.savetxt('submission.csv', np.rec.fromarrays([test_ids, predictions]), fmt=['%s', '%d'], delimiter=',', header='id,label', comments='')
!head submission.csv
Be aware that because this is a code competition with a hidden test set, internet and TPUs cannot be enabled on your submission notebook. Therefore TPUs will only be available for training models. For a walk-through on how to train on TPUs and run inference/submit on GPUs, see our TPU Docs.